SmartZoning® Documentation

Leveraging Permitting and Zoning Data to Predict Upzoning Pressure in Philadelphia

Author

Laura Frances and Nissim Lebovits

Published

December 14, 2023

This model and web application prototype were developed for MUSA508, a Master of Urban Spatial Analytics class focused on predictive public policy analytics at the University of Pennsylvania.

1 Background

Growth is critical for a city to continue to densify and modernize. The benefits of growth range from increased public transit use to updating the built environment to be more climate resilient. Growth fuels development and vice versa. Philadelphia is 6th largest city in the US, yet ranks 42nd in cost of living, and so growth is often met with concern. Many residents and preservationists ask: Will growth deteriorate the city’s best features? Will modernization make the city unaffordable to longtime residents?

Balancing growth with affordability is a precarious task for Philadelphia. To date, politicians favor making exceptions for developers parcel-by-parcel rather than championing a citywide smart growth strategy. Zoning advocates need better data-driven tools to broadcast the benefits of a smart growth approach, a planning framework that aims to maximize walkability and transit use to avoid sprawl, that also demonstrates how parcel-by-parcel, or spot zoning, creates unmet development pressure that can drive costs. Designed to support smart growth advocacy, SmartZoning is a prototype web tool that identifies parcels under development pressure with conflicting zoning. Users can strategically leverage the tool to promote proactive upzoning of high-priority parcels, aligning current zoning more closely with anticipated development. This approach aims to foster affordable housing in Philadelphia, addressing one of the city’s most pressing challenges.

Smart Growth meets SmartZoning®

2 Overview

Below, we outline how we developed a predictive model that effectively forecasts future annual development patterns with a low mean absolute error. This model is the basis of SmartZoning: by anticipating future development, it can highly where current zoning may hinder smart growth. We also consider the relationship between development pressure and race, income, and housing cost burden, demonstrating the generalizability of our model across different socioeconomic contexts and investigation the impacts of development locally and city-wide.

Show the code
required_packages <- c("tidyverse", "sf", "acs", "tidycensus", "sfdep", "kableExtra", "conflicted",
                       "gganimate", "tmap", "gifski", "transformr", "ggpubr", "randomForest", "janitor",
                       'igraph', "plotly", "ggcorrplot", "Kendall", "car", "shiny", "leaflet",
                       "classInt")
suppressWarnings(
install_and_load_packages(required_packages)
)

source("utils/viz_utils.R")



urls <- c(
  roads = 'https://opendata.arcgis.com/datasets/261eeb49dfd44ccb8a4b6a0af830fdc8_0.geojson', # for broad and market
  council_dists = "https://opendata.arcgis.com/datasets/9298c2f3fa3241fbb176ff1e84d33360_0.geojson",
  building_permits = building_permits_path,
  permits_bg = final_dataset_path,
  zoning = "https://opendata.arcgis.com/datasets/0bdb0b5f13774c03abf8dc2f1aa01693_0.geojson",
  opa = "data/opa_properties.geojson",
  ols_preds = 'data/model_outputs/ols_preds.geojson',
  rf_test_preds = 'data/model_outputs/rf_test_preds.geojson',
  rf_val_preds = 'data/model_outputs/rf_val_preds.geojson',
  rf_proj_preds = 'data/model_outputs/rf_proj_preds.geojson'
  
)

suppressMessages({
  invisible(
    imap(urls, ~ assign(.y, phl_spat_read(.x), envir = .GlobalEnv))
  )
})

broad_and_market <- roads %>% filter(ST_NAME %in% c('BROAD',"MARKET") | SEG_ID %in% c(440370, 421347,421338,421337,422413,423051,440403,440402,440391,440380))

council_dists <- council_dists %>%
                    select(DISTRICT)

building_permits <- building_permits %>%
                      filter(permittype %in% c("RESIDENTIAL BUILDING", "BP_ADDITION", "BP_NEWCNST"))

3 Feature Selection and Engineering

This study leverages open data sources including permit counts, council district boundaries, racial mix, median income, housing cost burden to holistically understand what drives development pressure. Generally, data is collected at the block group or parcel level and aggregated up to the council district to capture both local and more citywide trends.

Dataset Source Geo Level
Construction Permits Philadelphia Dept. of Licenses & Inspections Parcel
Zoning Base Map Planning Commission Parcel
Zoning Overlays Planning Commission Parcel
Demographic and Socioeconomic Data U.S. Census Bureau’s ACS 5-Y Block Group
Council District Boundaries and Leadership City of Philadelphia Parcel

3.1 Construction Permits

Permit data from 2013 through 2023, collected from the Philadelphia Department of Licenses & Inspections, are the basis of our model. We consider only new construction permits granted for residential projects, but in the future, filtering for data on “full” or “substantial” renovations could add nuance to the compelexities of development pressure. Given the granular spatial scale of our analysis, and the need to aggregate Census data to our unit of analysis, we chose to aggregate these permits data to the block group level.

Construction Permits per Year

Philadelphia, PA

Show the code
tm <- tmap_theme(tm_shape(permits_bg %>% filter(!year %in% c(2012, 2024))) +
        tm_polygons(col = "permits_count", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_facets(along = "year") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE))

suppressMessages(
tmap_animation(tm, "assets/permits_animation.gif", delay = 50)
)

Philadelphia Building Permits, 2013 - 2023
Show the code
ggplot(building_permits %>% filter(!year %in% c(2024)), aes(x = as.factor(year))) +
  geom_bar(fill = palette[1], color = NA, alpha = 0.7) +
  labs(title = "Permits per Year",
       y = "Count") +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        aspect.ratio = .75)

We note a significant uptick in new construction permits as we approach 2021, followed by a sharp decline. It is generally acknowledged that this trend was due to the expiration of a tax abatement program for developers.

When assessing new construction permit count by Council Districts, a few districts issued the bulk of new permits during that 2021 peak. Hover over the lines to see more about the volume of permits and who granted them.

Show the code
perms_x_dist <- st_join(building_permits, council_dists)

perms_x_dist_sum <- perms_x_dist %>%
                  st_drop_geometry() %>%
                  group_by(DISTRICT, year) %>%
                  summarize(permits_count = n())

perms_x_dist_mean = perms_x_dist_sum %>%
                      group_by(year) %>%
                      summarize(permits_count = mean(permits_count)) %>%
                      mutate(DISTRICT = "Average")

perms_x_dist_sum <- bind_rows(perms_x_dist_sum, perms_x_dist_mean) %>%
                        mutate(color = ifelse(DISTRICT != "Average", 0, 1))

ggplotly(
ggplot(perms_x_dist_sum %>% filter(year > 2013, year < 2024), aes(x = year, y = permits_count, color = as.character(color), group = interaction(DISTRICT, color))) +
  geom_line(lwd = 0.7) +
  labs(title = "Permits per Year by Council District",
       y = "Total Permits") +
  # facet_wrap(~DISTRICT) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(palette[5], palette[1]))
)

3.2 Spatio-Temporal Features

New construction exhibits sizable spatial and temporal autocorrelation. In other words, there is a strong relationship between the number of permits in a given block group and the number of permits in neighboring block groups; as well as between the number of permits issued in a block group in a given year and the number of permits issued in that same block group in the previous year. To account for these relationships, we engineer new features, including both space and time lags. We note that all of these engineered features have strong correlation coefficients with our dependent variable, permits_count, and p-values indicating that these relationships are statistically significant.

Show the code
permits_bg_long <- permits_bg %>%
                    filter(!year %in% c(2024)) %>%
                    st_drop_geometry() %>%
                    pivot_longer(
                      cols = c(starts_with("lag")),
                      names_to = "Variable",
                      values_to = "Value"
                    )


ggscatter(permits_bg_long, x = "permits_count", y = "Value", facet.by = "Variable",
   add = "reg.line",
   add.params = list(color = palette[3]),
   conf.int = TRUE, alpha = 0.2
   ) + 
  stat_cor(method = "pearson", p.accuracy = 0.001, r.accuracy = 0.01, size = 3) +
  labs(title = "Correlation of `permits_count` and Engineered Features",
       x = "Value",
       y = "Permits Count") +
  theme_minimal()

3.3 Socioeconomic Features

Socioeconomic factors such as race, income, and housing cost burden play an outsized role in affordability in cities like Philadelphia, which are marred by a pervasive and persistent history housing discrimination and systemic disinvestment in poor and minority neighborhoods. To account for these issues, we incorporate various data from the US Census Bureau’s American Community Survey 5-Year survey. Later, we also consider our model’s generalizability across different racial and economic contexts to ensure that it will not inadvertently reinforce structural inequity.

Spatially, is clear that non-white communities earn lower median incomes and experience higher rates of extreme rent burden (household spends more than 35% of income on gross rent).

Show the code
med_inc <- tmap_theme(tm_shape(permits_bg %>% filter(year == 2022)) +
        tm_polygons(col = "med_inc", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Med. Inc. ($)") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "Median Income")
  
race <- tmap_theme(tm_shape(permits_bg %>% filter(year == 2022)) +
        tm_polygons(col = "percent_nonwhite", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Nonwhite (%)") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "Race")
  
rent_burd <- tmap_theme(tm_shape(permits_bg %>% filter(year == 2022)) +
        tm_polygons(col = "ext_rent_burden", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Rent Burden (%)") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "Extreme Rent Burden")
  
tmap_arrange(med_inc, race, rent_burd)

4 Model Building

“All the complaints about City zoning regulations really boil down to the fact that City Council has suppressed infill housing or restricted multi-family uses, which has served to push average housing costs higher.” - Jon Geeting, Philly 3.0 Engagement Director

SmartZoning® seeks to predict where permits are most likely to be filed as a measure to predict urban growth. As discussed, predicting growth is fraught because growth is influenced by political forces rather than by plans published by the city’s Planning Commission. Comprehensive plans, typically set on ten-year timelines, tend to become mere suggestions, ultimately subject to the prerogatives of city council members rather than serving as steadfast guides for smart growth. With these dynamics in mind, SmartZoning’s prediction model accounts for socioeconomics, council district, and time-space lag.

4.1 Tests for Correlation and Collinearity

4.1.1 Correlation Coefficients

In building our model, we aim to select variables that correlate significantly with permit_count. Using a correlation matrix, we can assess whether our predictors are, in fact, meaningfully associated with our dependent variable. As it turns out, socioeconomic variables are not (we exclude the other variables, which we have previously established to be significant), but we retain them for the sake of later analysis.

Show the code
corr_vars <- c("total_pop",
               "med_inc",
               "percent_nonwhite",
               "percent_renters",
               "rent_burden",
               "ext_rent_burden")

corr_dat <- permits_bg %>% select(all_of(corr_vars), permits_count) %>% select(where(is.numeric)) %>% st_drop_geometry() %>% unique() %>% na.omit()

corr <- round(cor(corr_dat), 2)
p.mat <- cor_pmat(corr_dat)

ggcorrplot(corr, p.mat = p.mat, hc.order = FALSE,
    type = "full", insig = "blank", lab = TRUE, colors = c(palette[2], "white", palette[3])) +
  annotate(
  geom = "rect",
  xmin = .5, xmax = 7.5, ymin = 6.5, ymax = 7.5,
  fill = "transparent", color = "red", alpha = 0.5
)

4.1.2 VIF

We also aim to minimize or eliminate multicollinearity in our model. For this purpose, we evaluate the variance inflation factor (VIF) of a given predictor. The table below lists the VIF of all of our predictors; we exclude any with a VIF of 5 or more from our final model, including district, which is council district, and several historic district and planning overlays.

Show the code
ols <- lm(permits_count ~ ., data = permits_bg %>% filter(year < 2024) %>% select(-c(mapname, geoid10, year)) %>% st_drop_geometry())
vif(ols) %>%
  data.frame() %>%
  clean_names() %>%
  select(-df) %>%
  arrange(desc(gvif)) %>%
  kablerize()
gvif gvif_1_2_df
district 1175997.928011 2.173926
hist_dist_na 33.120644 5.755054
hist_dist_historic_street_paving_thematic_district 26.735386 5.170627
overlay_fne 15.601603 3.949886
overlay_ne 11.179327 3.343550
overlay_nis 7.717070 2.777961
overlay_ndo 6.867713 2.620632
overlay_fdo 6.574022 2.563985
overlay_edo 5.595256 2.365429
overlay_vdo 5.400210 2.323835
dist_to_transit 4.378395 2.092462
lag_spat_4_years 3.946301 1.986530
lag_spat_2_years 3.767694 1.941055
lag_spat_5_years 3.722626 1.929411
lag_spat_6_years 3.689699 1.920859
lag_spat_3_years 3.620857 1.902855
lag_spat_7_years 3.581870 1.892583
lag_spat_8_years 3.530726 1.879023
percent_nonwhite 3.147111 1.774010
lag_spat_1_year 2.943444 1.715647
overlay_ctr 2.938630 1.714243
med_inc 2.826465 1.681210
lag_spat_9_years 2.645426 1.626477
hist_dist_rittenhouse_fitler_residential 2.310139 1.519914
overlay_ahc 2.264476 1.504818
lag_4_years 2.260224 1.503404
overlay_min 2.184365 1.477960
hist_dist_spring_garden 2.178088 1.475835
hist_dist_diamond_street 2.126102 1.458116
lag_5_years 2.040912 1.428605
lag_2_years 2.012640 1.418676
overlay_ncp 1.990799 1.410957
hist_dist_girard_estate 1.933372 1.390457
overlay_wwo 1.931780 1.389885
hist_dist_overbrook_farms_historic_district 1.929129 1.388931
lag_6_years 1.896103 1.376991
lag_7_years 1.878892 1.370727
lag_8_years 1.873276 1.368677
overlay_other 1.865672 1.365896
hist_dist_old_city 1.839962 1.356452
lag_3_years 1.836923 1.355331
overlay_eco 1.800232 1.341727
overlay_nca 1.717870 1.310675
hist_dist_awbury_arboretum 1.690225 1.300086
hist_dist_park_mall_temple_universitys_campus 1.683083 1.297337
lag_1_year 1.680436 1.296316
dist_to_2022 1.654760 1.286375
lag_9_years 1.646576 1.283190
overlay_wst 1.616137 1.271274
overlay_nbo 1.611118 1.269298
percent_renters 1.602037 1.265716
overlay_tso 1.462620 1.209388
overlay_ima 1.453865 1.205763
ext_rent_burden 1.444448 1.201852
overlay_yod 1.381172 1.175233
overlay_hhc 1.379881 1.174683
overlay_nco 1.379048 1.174329
hist_dist_parkside 1.375646 1.172880
hist_dist_society_hill 1.348453 1.161229
overlay_tod 1.345925 1.160140
overlay_drc 1.313695 1.146165
overlay_gao 1.307361 1.143399
overlay_wwa 1.299857 1.140113
overlay_ued 1.299239 1.139842
overlay_na 1.287432 1.134651
hist_dist_tudor_east_falls 1.285346 1.133731
overlay_eod 1.283286 1.132822
rent_burden 1.268730 1.126379
overlay_cdo 1.266165 1.125240
hist_dist_league_island_park_aka_f_d_r_park 1.239856 1.113488
hist_dist_manayunk_main_street_historic_district 1.232682 1.110262
total_pop 1.215123 1.102326
overlay_nho 1.194565 1.092961
overlay_cgc 1.189653 1.090712
hist_dist_greenbelt_knoll 1.184422 1.088311
overlay_snm 1.181435 1.086938
overlay_cao 1.179384 1.085995
overlay_ame 1.156756 1.075526
overlay_ahp 1.149727 1.072253
overlay_stm 1.137920 1.066734
overlay_smh 1.135161 1.065439
overlay_env 1.035448 1.017570
overlay_wah 1.031113 1.015437
hist_dist_420_row 1.025286 1.012564
hist_dist_east_logan_street 1.022054 1.010967


Notably, permit count does not have a particularly strong correlation to any of our selected variables. This may lead one to the conclusion that permits are evenly distributed throughout the city. However, as we can see below, there are few block groups with more 50 permits. This indicates that permits are granted on a block by block across all districts. The need for SmartZoning is applicable for most Philadelphia neighborhoods, not just a select few.

Show the code
ggplot(permits_bg %>% st_drop_geometry %>% filter(!year %in% c(2024)), aes(x = permits_count)) +
  geom_histogram(fill = palette[1], color = NA, alpha = 0.7) +
  labs(title = "Permits per Block Group per Year",
       subtitle = "Log-Transformed",
       y = "Count") +
  scale_x_log10() +
  facet_wrap(~year) +
  theme_minimal() +
  theme(axis.title.x = element_blank())

4.2 Spatial Patterns

In addition to correlation between non-spatial variables, our dependent variable, permits_count, displays a high degree of spatial autocorrelation. That is, the number of permits at a given location is closely related to the number of permits at neighboring locations. We’ve accounted for this in our model by factoring in spatial lag, and we explore it here by evaluating the local Moran’s I values, which is the measure of how concentrated high or low values are at a given location. Here, we identify hotspots for new construction in 2023 by looking at statistically signficant concentrations of new building permits.

Show the code
lisa <- permits_bg %>% 
  filter(year == 2023) %>%
  mutate(nb = st_contiguity(geometry),
                         wt = st_weights(nb),
                         permits_lag = st_lag(permits_count, nb, wt),
          moran = local_moran(permits_count, nb, wt)) %>% 
  tidyr::unnest(moran) %>% 
  mutate(pysal = ifelse(p_folded_sim <= 0.1, as.character(pysal), NA),
         hotspot = case_when(
           pysal == "High-High" ~ "Yes",
           TRUE ~ "No"
         ))

# 
# palette <- c("High-High" = "#B20016", 
#              "Low-Low" = "#1C4769", 
#              "Low-High" = "#24975E", 
#              "High-Low" = "#EACA97")

morans_i <- tmap_theme(tm_shape(lisa) +
  tm_polygons(col = "ii", border.alpha = 0, style = "jenks", palette = mono_5_green, title = "Moran's I"),
  "Local Moran's I (2023)")

p_value <- tmap_theme(tm_shape(lisa) +
  tm_polygons(col = "p_ii", border.alpha = 0, style = "jenks", palette = mono_5_green, title = "P-Value"),
  "Moran's I P-Value (2023)")

sig_hotspots <- tmap_theme(tm_shape(lisa) +
  tm_polygons(col = "hotspot", border.alpha = 0, style = "cat", palette = c(mono_5_green[1], mono_5_green[5]), textNA = "Not a Hotspot", title = "Hotspot?"),
  "Construction Hotspots (2023)")

tmap_arrange(morans_i, p_value, sig_hotspots, ncol = 3)

Emergeging hotspots…? If I can get it to work.

Show the code
# Prepare the data
permits_data <- permits_bg %>%
  filter(year < 2024,
         year > 2012) %>%
  select(permits_count, geoid10, year) %>%
  na.omit()

# Create spacetime object
stc <- as_spacetime(permits_data,
                    .loc_col = "geoid10",
                    .time_col = "year")

# Run emerging hotspot analysis
ehsa <- emerging_hotspot_analysis(
  x = stc,
  .var = "permits_count",
  k = 1,
  nsim = 25
)

# Analyze the result
count(ehsa, classification)
       classification    n
1 no pattern detected 1336

4.3 Compare Models

To actually build our model, we have a range to choose from. OLS, or least squares regression, is among the most common; here, we use it as the basis for comparison with a random forest model, which is somewhat more sophisticated. We also considered a Poisson model, although found that it drastically overpredicted for outliers, and we therefore discarded it. As a point of comparison, we built both OLS and random forest models, trained them on data from 2013 through 2021, tested them on 2022 data, dn compared the results for accuracy, overfitting, and generalizability.

4.3.1 OLS

OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. Our model incorporates three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of the Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022 discussed earlier. We used this as a baseline model to compare to Poisson and Random Forest.

Overall, we found that our basic OLS model performed quite well; with a mean absolute error (MAE) of 2.68, it is fairly accurate in prediciting future development. We also note that it overpredicts in most cases which, given our goal of anticipating and preparing for high demand for future development, is preferrable to underpredicting. That said, it still produces a handful of outliers that deviate substantially from the predicted value. As a result, we considered a random forest model to see if it would handle these outliers better.

Show the code
suppressMessages(
ggplot(ols_preds, aes(x = permits_count, y = ols_preds)) +
  geom_point(alpha = 0.2) +
  labs(title = "Predicted vs. Actual Permits: OLS",
       subtitle = "2022 Data",
       x = "Actual Permits",
       y = "Predicted Permits") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  theme_minimal()
)

Show the code
ols_preds_map <- tmap_theme(tm_shape(ols_preds) +
        tm_polygons(col = "ols_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Predicted Permits: OLS")

ols_error_map <- tmap_theme(tm_shape(ols_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Absolute Error") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Absolute Error: OLS")

tmap_arrange(ols_preds_map, ols_error_map)

4.3.2 Random Forest

Random forest models are superior to OLS in their ability to capture non-linear patterns, outliers, and so forth. They also tend to be less sensitive to multicolinearity. Thus, we considered whether a random forest model would improve on some of the weaknesses of the OLS model. We found that this was indeed the case; the random forest model yielded a MAE of 2.91. Furthermore, the range of absolute error in the model was sizably reduced, with outliers exerting less of an impact on the model.

Show the code
suppressMessages(
ggplot(rf_test_preds, aes(x = permits_count, y = rf_test_preds)) +
  geom_point(alpha = 0.2) +
  labs(title = "Predicted vs. Actual Permits: RF",
       subtitle = "2022 Data",
       x = "Actual Permits",
       y = "Predicted Permits") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  theme_minimal()
)

Show the code
test_preds_map <- tmap_theme(tm_shape(rf_test_preds) +
        tm_polygons(col = "rf_test_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Predicted Permits: RF Test")

test_error_map <- tmap_theme(tm_shape(rf_test_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Absolute Error") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Absolute Error: RF Test") 

tmap_arrange(test_preds_map, test_error_map)

5 Model Testing

Model training, validation, and testing involved three steps. First, we partitioned our data into training, validation, and testing sets. We used data from 2013 through 2021 for initial model training. Next, we evaluated our models’ ability to accurately predict 2022 construction permits using our validation set, which consisted of all permits in 2022. We carried out additional feature engineering and model tuning, iterating based on the results of these training and testing splits. We sought to minimize both the mean absolute error (MAE) of our best model and the distribution of absolute error. Finally, when we were satisfied with the results of our best model, we evaluated it again by training it on all data from 2013 through 2022 and validating it on data from 2023 (all but the last two weeks, which we consider negligible for our purposes), which the model had never “seen” before. As Kuhn and Johnson write in Applied Predictive Modeling (2013), “Ideally, the model should be evaluated on samples that were not used to build or fine-tune the model, so that they provide an unbiased sense of model effectiveness.”

Again, testing confirms the strength of our model; based on 2023 data, our random forest model produces a MAE of 2.19. We note again that the range of model error is relatively narrow. Generally, we see that where the model predicts there to be more permits, there is also higher error. This spatial trend is also seen in the distribution of absolute errors clustering in a handful of block groups with high permit counts.

Show the code
val_preds_map <- tmap_theme(tm_shape(rf_val_preds) +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Predicted Permits: RF Validate")

val_error_map <- tmap_theme(tm_shape(rf_val_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Absolute Error") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Absolute Error: RF Validate")

tmap_arrange(val_preds_map, val_error_map)

Show the code
suppressMessages(
ggplot(rf_val_preds, aes(x = permits_count, y = rf_val_preds)) +
  geom_point(alpha = 0.2) +
  labs(title = "Predicted vs. Actual Permits: RF",
       subtitle = "2023 Data",
       x = "Actual Permits",
       y = "Predicted Permits") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  theme_minimal()
)

When comparing the spread of predicted versus actual permits, there is more divergence at higher permit counts which is also observed spatially. Throughout this study, we’ve hedged against any model’s sensitivity towards outliers and while there is noticeably variability of prediction errors at high permit counts, we remain confident about our model selection.

6 Generalizabiltiy

Comparing model error per block group to several sociodemographic characteristics indicates that the model generalizes well; it displays consistent and relatively low absolute errors across income, racial mix, rent burden, and total population. Error is not correlated with affordability and in fact is slightly negatively correlated with the percentage of the nonwhite population. This observed pattern may be attributed to the likelihood that majority-minority neighborhoods experience a comparatively lower volume of overall development, thereby diminishing the absolute magnitude of error, despite potential proportional increases. Additionally, there is a slight increase in error with the total population, aligning with the expectation that higher population figures correspond to more extensive development activities.

Show the code
rf_val_preds_long <- rf_val_preds %>%
  pivot_longer(cols = c(rent_burden, percent_nonwhite, total_pop, med_inc),
               names_to = "variable", values_to = "value") %>%
  mutate(variable = case_when(
    variable == "med_inc" ~ "Median Income ($)",
    variable == "percent_nonwhite" ~ "Nonwhite (%)",
    variable == "rent_burden" ~ "Rent Burden (%)",
    TRUE ~ "Total Pop."
  ))

ggplot(rf_val_preds_long, aes(x = value, y = abs_error)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  facet_wrap(~ variable, scales = "free_x") +
  labs(title = "Generalizability of Absolute Error",
       x = "Value",
       y = "Absolute Error") +
  theme_minimal()

Considering the reign of concilmanic perogative in Philadelphia, briefly discussed in the Background, it is important to ask whether the model is generalizable across the city’s 10 council districts. Consistent with the error distributions mentioned previously, we note that the council districts with the highest volume of permits also have the widest distribution of errors. That said, the overall distribution of error per district remains narrow; no district has a mean error above 3, and only district 5 has an upper quartile error above 5. Evidently, a handful of outliers drive up the MAE, but the model error is well-dsitributed overall.

Show the code
suppressMessages(
  ggplot(rf_val_preds, aes(x = reorder(district, abs_error, FUN = mean), y = abs_error)) +
    geom_boxplot(fill = NA, color = palette[3], alpha = 0.7) +
    labs(title = "MAE by Council District",
         y = "Mean Absolute Error",
         x = "Council District") +
    theme_minimal()
)

7 Assessing Upzoning Pressure

The core functionality of our SmartZoning tool is to identify conflict between projected development and current zoning. It is interesting to note that RSA4 and RSA5 (Residential Single-family Attached) zoning is also where we predict the most development pressure. When speaking with Jon Geeting, Engagement Director of Philadelphia 3.0, he noted that RSA4 and RSA5 are two of the most prohibitive zoning designations, especially in areas close to amenities like schools, grocery stores, and most of all, public transit.

Show the code
filtered_zoning <- zoning %>%
                     filter(str_detect(CODE, "RS") | str_detect(CODE, "I"),
                            CODE != "I2",
                            !str_detect(CODE, "SP")) %>%
                     st_join(., rf_val_preds %>% select(rf_val_preds))


zoning_map <- tmap_theme(tm_shape(filtered_zoning) +
        tm_polygons(col = "CODE", border.alpha = 0, colorNA = "lightgrey", title = "Zoning Code", palette = zoning_palette) +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE,
            legend.height = 0.4),
  "Restrictive Zoning")
  
mismatch <- tmap_theme(tm_shape(filtered_zoning) +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, colorNA = "lightgrey", palette = mono_5_orange, style = "fisher", title = "Predicted New Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Development Pressure")

tmap_arrange(zoning_map, mismatch)

Using these data, we can filter only for properties that 1) are zoned restrictively and 2) face high development pressure. We note that these properties occur primarily in Furthermore, we can identify properties with high potential for assemblage, which suggests the ability to accommodate high-density, multi-unit housing. Because developers seek out assemblage opportunities to create more sizable and therefore more economically viable projects, assemblage also drives development pressure.

Show the code
tmap_mode('view')

filtered_zoning %>%
  filter(rf_val_preds > 10) %>%
tm_shape() +
        tm_polygons(col = "CODE", border.alpha = 0, colorNA = "lightgrey",
                    popup.vars = c('rf_val_preds', 'CODE'), palette = zoning_palette) +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE)
Show the code
nbs <- filtered_zoning %>% 
  mutate(nb = st_contiguity(geometry))

# Create edge list while handling cases with no neighbors
edge_list <- tibble::tibble(id = 1:length(nbs$nb), nbs = nbs$nb) %>% 
  unnest(nbs) %>% 
  filter(nbs != 0)

# Create a graph with a node for each row in filtered_zoning
g <- make_empty_graph(n = nrow(filtered_zoning))
V(g)$name <- as.character(1:nrow(filtered_zoning))

# Add edges if they exist
if (nrow(edge_list) > 0) {
  edges <- as.matrix(edge_list)
  g <- add_edges(g, c(t(edges)))
}

# Calculate the number of contiguous neighbors and sum of contiguous areas
n_contiguous <- numeric(nrow(filtered_zoning))
sum_contig_area <- numeric(nrow(filtered_zoning))

for (i in 1:nrow(filtered_zoning)) {
  neighbors <- neighborhood(g, order = 1, nodes = i)[[1]]
  # Exclude the node itself from its list of neighbors
  neighbors <- neighbors[neighbors != i]
  n_contiguous[i] <- length(neighbors)
  sum_contig_area[i] <- sum(filtered_zoning$Shape__Area[neighbors], na.rm = TRUE)
}

contig_info <- data.frame(n_contig = unlist(n_contiguous), sum_contig_area = unlist(sum_contig_area))
filtered_zoning <- cbind(filtered_zoning, contig_info)


filtered_zoning %>%
  st_drop_geometry() %>%
  select(rf_val_preds,
         n_contig,
         sum_contig_area,
         CODE) %>%
  filter(rf_val_preds > 10,
         n_contig > 2) %>%
  arrange(desc(rf_val_preds)) %>%
  kablerize(caption = "Poorly-Zoned Properties with High Development Risk")
Poorly-Zoned Properties with High Development Risk
rf_val_preds n_contig sum_contig_area CODE
9169 27.94353 3 7749.422 ICMX
12326 17.81977 3 11550.410 RSA5
4512 14.36920 4 476771.859 IRMX
6014 14.36920 5 478421.629 ICMX
2138 11.07510 3 11621.492 IRMX
9409 11.04757 3 84259.828 RSA2
5108 10.92453 3 5743.941 IRMX
1536 10.63863 3 3129719.988 ICMX
2422 10.63863 4 3390870.230 IRMX
2941 10.63863 3 3129719.988 RSA5
10810 10.63863 4 3135053.539 ICMX
11135.1 10.63863 7 2753612.852 I3

8 2024 Predictions

Using the trained and tested model, we predict where new construction permits are most likely to be filed and granted in 2024. Interestingly, when comparing the predicted map with median income distribution, development pressure is expected to be greatest in block groups with higher median incomes. The relationship between predicted development and higher median incomes suggests that developers may be strategically targeting areas with greater economic prosperity to meet market demand, de-risk return on investment, and tap into more robust infrastructure and local services. This is a keu insight that zoning advocates can leverage to push for smart growth policies by emphasizing the need for more equitable development policies, incentives for affordable housing, and more meaningful transit-oriented-development programs.

Show the code
tmap_mode('plot')

preds24 <- tmap_theme(tm_shape(rf_proj_preds) +
        tm_polygons(col = "rf_proj_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Predicted New Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Projected New Development, 2024")

tmap_arrange(preds24, med_inc)

9 Web Application

Here is a preview of the SmartZoning web application. The UX offers key features that leverages this study’s modeling and mapping.

Key Features:

  • Interactive parcel level map that makes key information about parcel easily accessible
  • Filters to enhance spatial exploration customizable to difference audiences
  • View switch between current development and future development (v2 could include historical permit and zoning data for powerful comparatives)
  • Compare functionality to understand neighborhood context
  • Customizable report generator to instantly create actionable assets

The tool is currently in beta testing. The goal is to optimize it for the use case of a smart growth or rezoning advocate use case. What is most critical is to prioritize the buttons and filters so that advocates can use the tool as a real-time accessory in negotiations with council members, planning commissions, and other critical stakeholders.

The intended impact is to provide advocates with data-driven insights that can eloquently and efficiently make the case for streamlining smart growth policies. By providing an unprecedented combination of predictions and supplmentary data, the SmartZoning® tool will enable users to proactively engage decision makers. Stay tuned for more updates.

10 Next Steps

SmartZoning® emphasizes the importance of data-driven decision-making in zoning. Utilizing parcel level data to comprehensively analyze the whole city helps to identify areas with the potential for growth and guide zoning decisions that prioritize a balanced and equitable distribution of development opportunities.

In future iterations, the model may want to focus on the potential for upzoning in areas where there are no permits predicted but high income, such as in Center City and Society Hill. To address concerns that by highlighting where upzoning is happening now and potentially exacerbating gentrification trends in neighborhoods like Fairmount and Grays Ferry, the model could be calibrated to suggest development mismatch in underdeveloped, but high income areas. The goal is to create a balanced and inclusive approach to development that benefits the existing community, preserves neighborhood character, and provides opportunities for all residents - SmartZoning® can help inform how this balance is struck.